In the next two hours we are going to explore some of the ways unstructured data can be plotted using R. The list of plot is by no means exhaustive and it is just a short overview of possible options. A very good starting point to explore graphs remains https://r-graph-gallery.com/
Let’s say that you are exploring the results of OCR’d documents and you want to check where the low percentage of confidentiality of in OCR it comes from and if it cluster around certain areas Let’s see the original file:
image1 <- image_read("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/page10.jpg")
image1
ok this seems quite a well defined images let’s see the results of the OCR on it
PrideAndPrejudiceCh10<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/OCRPrideAndPrejudice.csv")
mean(PrideAndPrejudiceCh10$confidence)
## [1] 94.59075
I can do that by plotting the ocr confidence level on top of the original page
ggplot(PrideAndPrejudiceCh10, aes(MeanX, Meany, colour= confidence)) + #Colour code by confidence
background_image(image1)+ # Select the image I want as a background
geom_point(shape=15, size=5, alpha=0.5)+#plot them as squared points
coord_cartesian(xlim = c(0, 2481),ylim = c(3508, 0),
expand = FALSE)+ #Use the sizes of the image to scale the graph
scale_colour_continuous(low="red", high="green") + #Create a continuous scale from green to red to colour-code the results
theme_bw()# b/w background
First I import the image
image2 <- image_read("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/page1.jpg")
image2
Ok this seems quite a well defined images let’s see the results of the OCR on it
Again I am importing them from GitHub
PrideAndPrejudiceIncipit<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/PrideAndPrejudiceIncipit.csv")
mean(PrideAndPrejudiceIncipit$confidence)
## [1] 42.67584
That is not good at all but can we see where the issues are through data visualisation
We can do so by yet again plotting it
ggplot(PrideAndPrejudiceIncipit, aes(MeanX, Meany, colour= confidence)) + #Colour code by confidence
background_image(image2)+ # Select the image I want as a background
geom_point(shape=15, size=5, alpha=0.5)+#plot them as squared points
coord_cartesian(xlim = c(0, 3307),ylim = c(4677, 0),
expand = FALSE)+ #Use the sizes of the image to scale the graph
scale_colour_continuous(low="red", high="green") + #Create a continuous scale from green to red to colour-code the results
theme_bw()# b/w background
This is what you can do with OCR data let’s now look at proper textual data
Before being able to plot unstructured data there is some cleaning and setting up that is always needed.
So let’s go through step by step:
We do so yet again from the GitHub
ScotAccount<- read_csv("https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/ScotlandParishes.csv")
We do that by running summary
summary(ScotAccount)
## title text Type TypeDescriptive
## Length:27065 Length:27065 Length:27065 Length:27065
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## RecordID Area Parish Year
## Length:27065 Length:27065 Length:27065 Length:27065
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Tables
## Length:27065
## Class :character
## Mode :character
This is a very large dataset
The ‘Old’Statistical Account(1791-99), under the direction of Sir John Sinclair of Ulbster, and the ‘New’ Statistical Account(1834-45) are reports of life during.
They offer uniquely rich and detailed parish reports for the whole of Scotland, covering a vast range of topics including agriculture, education, trades, religion and social customs.
https://stataccscot.edina.ac.uk/static/statacc/dist/home
Everything from changing fashions in dress to the different attitudes to smallpox inoculation and resulting high infant mortality between the north and south of Scotland
Our datasets are 27065 records corresponding to single reports from the statistical accounts about a certain Parish.
A quanteda corpus is a special type of data structure used to represent a collection of texts. It is designed to facilitate the analysis of textual data in a quantitative manner. The quanteda package provides functions for creating, manipulating, and analyzing corpora.
If you do not specify a column the system will search for a variable named text if you do not have one you need to specify the name of column that contain the text you want to analyse.
ScotAccountCorpus<-corpus(ScotAccount)
“Tokenize” is a term used in natural language processing (NLP) and text analysis to refer to the process of breaking down a text into individual units, which are often referred to as “tokens.” Tokens can be words, subwords, or other units of text, depending on the level of granularity desired for analysis.
Definition: Breaking a text into individual words. Example: “The quick brown fox” would be tokenized into [“The”, “quick”, “brown”, “fox”].
Definition: Breaking a text into individual sentences. Example: “This is the first sentence. And this is the second.” would be tokenized into [“This is the first sentence.”, “And this is the second.”].
Tokenization is a fundamental step in many natural language processing tasks because it helps to convert raw text into a format that can be easily processed and analyzed by algorithms. Once tokenized, the resulting units can be used for tasks such as counting word frequencies, training machine learning models, or extracting linguistic features for further analysis.
In a dfm, each row corresponds to a document, and each column corresponds to a feature (typically a term or word). The matrix entries represent the frequency of each feature in each document. The term “feature” is used more broadly than “term” because the features can include not only individual words but also phrases or other linguistic units.
Let’s do these steps on our dataset
dfmat_ScotAccountCorpus <- corpus_subset(ScotAccountCorpus) |> # which data
tokens(remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols=TRUE) |> # Tokenise
tokens_select(min_nchar = 3)|> # Refine tokens by removing words shorter than 3 characters
tokens_remove(pattern = c(stopwords('english'), "Parish")) |> # Remove common + custom stopwords
dfm()|> # create a Document Feature Matrix
dfm_trim(min_termfreq = 2000, verbose = FALSE) # remove all words that are recurring less than 2000 times. Verbose hide the result of this step from the console output this is just to speed up the process.
Stop words are common words that are often filtered out during the preprocessing of natural language text data before analysis. These words are generally the most common words in a language and are considered to carry little or no meaningful information about the content of the text. Including stop words in text analysis can introduce noise and may not contribute much to the understanding of the underlying patterns.
Common examples of stop words in English include words like “the,” “and,” “is,” “in,” “to,” and so on. The specific list of stop words can vary depending on the context and the task at hand.
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud
Let’s start from a simple one
set.seed(100) # to create reproducible results when writing code
textplot_wordcloud(dfmat_ScotAccountCorpus) # Plot our wordcloud
if we want to compare across specific areas of Scotland First I need to process the corpus again but I am looking only at data from Fife, Edinburgh, Haddington and Stirling.
CorpusSubset<-corpus_subset(ScotAccountCorpus, # which data
Area %in% c("Fife", "Edinburgh", "Haddington", "Stirling")) |> # select only some data from the Area column, only those from Fife, Edinburgh, Haddington and Stirling
tokens(remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols=TRUE) |> # tokenise
tokens_select(min_nchar = 3)|> # Refine tokens by removing words shorter than 3 characters
tokens_remove(pattern = c(stopwords('english'), "Parish", "Edinburgh", "Fife", "Haddington", "Stirling")) |> # remove English stopwords plus costum ones
dfm() |>
dfm_group(groups = Area) |> # create a Document Feature Matrix
dfm_trim(min_termfreq =500, verbose = FALSE) # remove all words that are recurring less than 500 times. Verbose hide the result of this step from the console output this is just to speed up the process.
set.seed(100) # to create reproducible results when writing code
textplot_wordcloud(CorpusSubset,comparison = TRUE) # Comparison true would divide the cloud by the areas nb max number of cluster is 8.
Let’s refine the colours
textplot_wordcloud(CorpusSubset,comparison = TRUE,
color = c("red","purple", "green", "blue"),# Select colours
max_words = 100, # max number of words displayed
min_size = 0.8,# min size font
max_size = 5)# max size font
We can also be fancy and we can shape the wordcloud onto the shape of Scotland
To do so we need to use a library ggwordcloud. this works with dataframes so we need to transform our DFM into a dataframe containing the frequency of words we can do this by extracting the frequency (you will need quanteda.textstats for this as well)
tstat_freq_Stats <- textstat_frequency(dfmat_ScotAccountCorpus, n = 75) # extracting the top 75 words
Then we get the image that we want to use in this case the shape of Scotland
# First we get the image
image_url <- "https://raw.githubusercontent.com/DCS-training/Good-Data-Visualisation-with-R/main/DataClass3/Scotland.png"
# Download the image to a local file
download.file(url = image_url, destfile = "scotland.png", mode = "wb")
set.seed(42)# Again this is to make it reproducible
# We can look at it
Scotland<-image_read("scotland.png")
Scotland
# Finally we define the image path
image_path <- here::here("Scotland.png") # specify where the image is (need the here library) Need to be a Png for handling transparency
ggplot(tstat_freq_Stats, aes(label = feature, size = frequency, colour = frequency)) +
geom_text_wordcloud_area(
mask = png::readPNG(image_path),
rm_outside = TRUE # This should remove text outside the mask area
) +
scale_size_area(max_size =35) +
scale_color_gradient(low = "darkblue", high = "blue") +
theme_minimal()
We extracted the frequency now let’s look at other ways to plot it (i.e. what are the most recurrent words in a series of texts)
ggplot(tstat_freq_Stats, aes(x = frequency, y = reorder(feature, frequency))) +
geom_point() +
labs(x = "Frequency", y = "Feature")+
theme_bw()
If you wanted to compare the frequency of a single term across different texts, you can also use textstat_frequency, group the frequency by Area and extract the term.
First we re-tokenise but without do a dfm
toks_corpus_Scot_subset <-
corpus_subset(ScotAccountCorpus) |> # Ceci n'est pas une pipe...
tokens()
freq_grouped <- textstat_frequency(dfm(toks_corpus_Scot_subset),
groups = Area)
freq_church <- subset(freq_grouped, freq_grouped$feature %in% "church")
ggplot(freq_church, aes(x = frequency, y = group)) +
geom_point() +
scale_x_continuous(limits = c(0, 1750), breaks = c(seq(0, 1750, 150))) +
labs(x = "Frequency", y = NULL,
title = 'Frequency of "Church"')+
theme_bw()
For this one we need to use a different dataset cause the one we were working on is too big.
So we are using the inaugurations speaches of the USA presidents.
The dataset is a default dataset that can be used directly
toks_corpus_inaugural_subset <-
corpus_subset(data_corpus_inaugural, Year > 1949) |>
tokens()
The term “keyword in context” (KWIC) refers to a method of displaying specific words or phrases in their surrounding context, typically within a larger body of text. This technique is commonly used in linguistics, information retrieval, and text analysis to better understand how certain words are used and the context in which they appear.
In a KWIC display, a specific word or phrase (the “keyword”) is centered in the middle of the display, and the text snippets or lines containing that keyword are shown in context, usually to the left and right of the keyword. This provides a quick and concise way to observe the usage patterns and surrounding words of a particular term.
kwic(toks_corpus_inaugural_subset, pattern = "american") |>
textplot_xray()
For the last bit we go back to our original dataset and we check how big the dataset is
It is always useful to extract information about the corpus we are working with Some methods for extracting information about the corpus:
# to explore this we focus on the text column of the ScotAccount
CorpusStat<-corpus(ScotAccount$text)
# Print doc in position 5 of the corpus
summary(CorpusStat, 5)
## Corpus consisting of 27065 documents, showing 5 documents:
##
## Text Types Tokens Sentences
## text1 94 171 1
## text2 167 399 1
## text3 122 201 1
## text4 123 262 1
## text5 146 273 1
# Check how many docs are in the corpus
ndoc(CorpusStat)
## [1] 27065
# Check number of characters in the first 10 documents of the corpus
nchar(CorpusStat[1:10])
## text1 text2 text3 text4 text5 text6 text7 text8 text9 text10
## 940 1906 1011 1245 1354 1422 1995 1916 1855 1977
# Check number of tokens in the first 10 documents
ntoken(CorpusStat[1:10])
## text1 text2 text3 text4 text5 text6 text7 text8 text9 text10
## 171 399 201 262 273 281 422 387 365 454
Can we create some better visualisation to look into it?
Yes of course.
Create a new vector with tokens for all articles and store the vector as a new data frame with three columns (Ntoken, Dataset, Date).
NtokenStats<-as.vector(ntoken(CorpusStat))
TokenScotland <-data.frame(Tokens=NtokenStats, title=ScotAccount$title, Area=ScotAccount$Area, Parish=ScotAccount$Parish)
Now we want to see how much material we have for each area. We can do that through pipes
BreakoutScotland<- TokenScotland |>
group_by(Area)|>
summarize(NReports=n(), MeanTokens=round(mean(Tokens)))
Now we can plot the trends.
This is done through the use of the ggplot package that is a very handy package that will allow you to print a very big variety of graphs
ggplot(BreakoutScotland, aes(x=Area, y=NReports))+ # Select data set and coordinates we are going to plot
geom_point(aes(size=MeanTokens, fill=MeanTokens),shape=21, stroke=1.5, alpha=0.9, colour="black")+ # Which graph I want
labs(x = "Areas", y = "Number of Reports", fill = "Mean of Tokens", size="Mean of Tokens", title="Number of Reports and Tokens in the Scotland Archive")+ # Rename labs and title
scale_size_continuous(range = c(5, 15))+ # Resize the dots to be bigger
geom_text(aes(label=MeanTokens))+ # Add the mean of tokens in the dots
scale_fill_viridis_c(option = "plasma")+ # Change the colour coding
theme_bw()+ # B/W Background
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "bottom")+ # Rotate labels of x and move them slightly down. Plus move the position to the bottom
guides(size = "none") # Remove the Size from the Legend